Skip to content

Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1

Merged
brendanbabb merged 12 commits intomainfrom
boston/security-hardening-and-aws-docs
Apr 30, 2026
Merged

Boston fork: security hardening, upstream-portal protection, AWS hosting docs#1
brendanbabb merged 12 commits intomainfrom
boston/security-hardening-and-aws-docs

Conversation

@brendanbabb
Copy link
Copy Markdown

@brendanbabb brendanbabb commented Apr 16, 2026

Summary

Two phases of work, all on this branch:

Phase 1 — Infra/security/docs (original PR scope)

  1. Infra — us-west-2 region, reserved Lambda concurrency as a hard cap on upstream fan-out, cp311/manylinux Lambda packaging, S3 backend config extracted out of the public repo.
  2. Security — allowlist-only AST-aware SQL validator + aggregate_data SQL builder; 64 KB request body cap with 413 before JSON parsing.
  3. Docsdocs/AWS_DEPLOYMENT.md, docs/SECURITY.md, stdio_bridge.py, CLAUDE.md.

Phase 2 — GPT-4o tool-chaining + CKAN ergonomics (added 2026-04-30)

Real-world failures observed via the deployed Lambda informed each fix:

  1. GPT-4o chaining (2f3c09f) — fattened tool descriptions with explicit "Next step" hints; suggested_resource_id surfaced at top of search/dataset output; new composite ckan__search_and_query tool that internally chains search_datasets + query_data.
  2. datastore_active filtering (ddcee2e, 668e24f) — Boston datasets attach 5–7 download-only resources (GeoJSON, KML, SHP, PDF) plus a single CSV that's actually loaded into the datastore. Plugin now picks the queryable one and labels each resource [QUERYABLE] vs [DOWNLOAD-ONLY]. Fixes the parks query that 404'd.
  3. Structured where clauses (5825317) — new where argument on query_data and search_and_query supporting eq, ne, gt, gte, lt, lte, in, not_in, like, ilike, is_null via SafeSQLBuilder. Routes through datastore_search_sql when set. Fixes "311 closed on 2026-04-29" (returns 85, the true count).
  4. Schema footer — every successful row response ends with a Filterable columns block listing field names + types so the next pivot is a one-shot.
  5. Sibling resources + resource_name (c1608e2) — Boston's 311 dataset has 22 queryable CSVs (rolling NEW SYSTEM + per-year archives 2011–2026); the model couldn't see them. Sibling block now lists them; new resource_name argument picks one by case-insensitive substring match (resource_name="2020"311 SERVICE REQUESTS - 2020).
  6. Limit-truncation guards (d9b6fca, 47fbdca) — model was reading "Found 100 record(s)" as a count when 100 was the LIMIT. CKAN's total (from datastore_search) and count (from package_search) now surfaced. SQL path does a COUNT(*) follow-up when len(records) >= limit. execute_sql extracts the effective LIMIT and warns when truncated. Phrasing switched to "100 of 531 rows returned" everywhere. Fixes "311 closed on 2016-04-29" (now reports 531 instead of the limit-100 misread).
  7. Lint (a8e9fea) — ruff F541.

Why

Same ground rule as Phase 1: don't overwhelm data.boston.gov. Phase 2 layers on a different goal — make the tools usable enough that GPT-4o (which is weaker at multi-step chaining than Claude) can answer real questions without fumbling. Each fix landed in response to a specific observed failure on the deployed prod Lambda.

Test plan

  • Test suite: 212 passing (was 81 pre-Phase-2)
  • Code quality: ruff clean after a8e9fea
  • Live verification on deployed prod Lambda (boston-opencontext-mcp-prod in us-west-2):
    • search_and_query("parks")Park_Features CSV, real park rows
    • search_and_query("311", where={close_date: {gte: "2026-04-29", lt: "2026-04-30"}, case_status: "Closed"}) → 85 (matches manual SQL count)
    • search_and_query("311", resource_name="2016", where={closed_dt: {gte: "2016-04-29", lt: "2016-04-30"}})100 of 531 rows returned + TRUNCATED block telling the model "the answer is 531, NOT 100"
    • search_datasets("parks", limit=5)5 of 21 matching dataset(s) shown

🤖 Generated with Claude Code

brendanbabb and others added 12 commits April 16, 2026 13:01
… pin

Move the deployment to us-west-2, add reserved Lambda concurrency as the
primary brake on fan-out into the upstream CKAN portal, and pin Lambda
packaging to cp311/manylinux wheels so the ZIP works regardless of build
host Python version.

- terraform/aws: add lambda_reserved_concurrency (default 10) wired to
  aws_lambda_function.reserved_concurrent_executions. Extract the S3
  backend config out of main.tf; real backend.tf is gitignored because
  the bucket name embeds the deployer's AWS account ID. Ship
  backend.tf.example as the template.
- prod/staging tfvars: aws_region=us-west-2, api_quota_limit=3000
  (was 1000), lambda_reserved_concurrency=10. Prod custom domain is
  boston-data.codeforanchorage.org; staging has no custom domain.
- scripts/deploy.sh + .github/workflows/release.yml: force cp311
  manylinux wheel resolution on every pip/uv install (without this,
  a Python 3.14 build host produces a ZIP that 502s at Lambda cold
  start). Detect python3/python cross-platform. Build the ZIP with
  stdlib zipfile instead of the `zip` binary so the packaging step
  works on CI images and Windows.
- scripts/setup-backend.sh: fix malformed bucket name
  (boston-opencontext-opendataterraform-state-... → boston-opencontext-
  tfstate-...).
- config.yaml: replace symlink-to-example with a concrete Boston CKAN
  config targeting data.boston.gov. ArcGIS kept disabled for reference.
- local_server.py: accept POSTs on both / and /mcp so the same local
  server works with Claude Desktop stdio bridges and MCP Inspector.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…size

Tighten the two attack surfaces that directly forward user-controlled
input into upstream CKAN: the execute_sql path and the aggregate_data
path. Add a request-body-size cap on the HTTP handler to bound the
work a single JSON-RPC call can cost. See docs/SECURITY.md for the
full threat model.

- plugins/ckan/sql_validator.py
  * SQLValidator: shrink MAX_SQL_LENGTH 50000 → 8192; strip SQL
    comments before keyword/function scans so /* ... */ and -- ...
    obfuscation can't smuggle forbidden tokens past the checks;
    expand FORBIDDEN_KEYWORDS with PREPARE/COPY/LISTEN/NOTIFY/VACUUM/
    ANALYZE/CLUSTER/REINDEX/LOAD/DO; add FORBIDDEN_FUNCTIONS
    (xp_cmdshell, pg_sleep, pg_read_file, pg_ls_dir, pg_stat_file,
    lo_import, lo_export, current_setting, set_config, dblink);
    walk the sqlparse AST to require every FROM/JOIN target to be a
    UUID-quoted resource or a CTE alias (rejects schema-qualified
    targets like pg_catalog.pg_class); match INTO OUTFILE/DUMPFILE.
  * New enforce_row_limit: appends LIMIT 10000 to any validated SQL
    that lacks a top-level LIMIT so a caller can't trigger an
    unbounded scan on a multi-million-row CKAN DataStore table.
  * New SafeSQLBuilder: typed, allowlist-only builder for the
    aggregate_data path. Identifiers must match [A-Za-z_]\w*, metric
    expressions must be count(*) or {count|sum|avg|min|max|stddev}
    ([DISTINCT] <ident>), filter values coerced per type with '
    escaping, order_by parsed and quoted, limit clamped to 10000,
    HAVING values must be numeric.
- plugins/ckan/plugin.py: route aggregate_data through
  SafeSQLBuilder (was string concatenation); call
  SQLValidator.enforce_row_limit after validate_query.
- server/http_handler.py: reject JSON-RPC bodies > 65 KB with
  HTTP 413 before parsing. The MCP surface fits in a few KB; a
  megabyte payload is either a bug or abuse.
- tests: cover body-size cap at and over the boundary, each new
  forbidden keyword/function, comment obfuscation, schema-qualified
  FROM rejection, enforce_row_limit behavior, and every
  SafeSQLBuilder method.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the Boston fork's AWS hosting and security posture, with
portal protection as the top design constraint. Add stdio_bridge.py
as a Python alternative to the Go stdio client for Claude Desktop.

- docs/AWS_DEPLOYMENT.md: how this fork is hosted (us-west-2, custom
  domain, reserved concurrency, cp311 packaging), what changed vs.
  upstream's single-region default, and how to operate the stack.
  Leads with the portal-protection design constraint.
- docs/SECURITY.md: the full rationale behind the hardening changes,
  organized around who is being protected — upstream portal first,
  deployment second, end users third. Covers the SQL validator and
  SafeSQLBuilder surface, rate limits and body-size cap, privacy
  posture (stateless, no PII, 14-day log retention, SQL truncated
  to 500 chars), and known gaps.
- README.md: link both new docs from the documentation table.
- stdio_bridge.py: minimal Python stdio-to-HTTP bridge. Reads
  JSON-RPC messages from stdin, POSTs them to the local/remote MCP
  server, writes responses to stdout. Useful where the Go client
  is impractical (Windows, no Go toolchain).
- CLAUDE.md: repo guidance for Claude Code sessions — commands,
  request flow, architecture notes, single-plugin rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds Method 4 to docs/TESTING.md covering how to wire the local HTTP
server to Claude Desktop (claude_desktop_config.json) and Claude Code
(.mcp.json) via stdio_bridge.py. The bridge was previously only mentioned
in CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ce_id, search_and_query)

GPT-4o struggles to extract a resource UUID from one tool result and pass
it to the next call. This patch makes the CKAN plugin friendlier to weaker
multi-step tool callers without changing existing tool semantics:

- Tool descriptions now end with explicit "Next step" guidance naming the
  follow-up tool (ckan__query_data, ckan__get_dataset, etc.).
- Per-parameter descriptions spell out provenance ("the `id` field inside
  the `resources` array returned by ckan__search_datasets").
- search_datasets / get_dataset formatted output leads with a NEXT STEP
  block exposing suggested_resource_id and a literal suggested_call line.
- New composite tool ckan__search_and_query that chains search + query
  server-side, eliminating the cross-call ID-passing problem entirely.

Tests: tool count 6 -> 7, three new cases for search_and_query happy path,
no-match, and dataset-with-no-resources. Full CKAN suite passes (32/32).
Boston attaches each dataset as 5-7 resources (GeoJSON, KML, SHP, PDF,
ArcGIS REST, CSV) but only the CSV is loaded into the queryable Postgres
datastore. The previous code blindly handed the model `resources[0].id`,
which is the GeoJSON one for most parks/permits/etc. datasets, and the
caller then got a 404 from datastore_search.

Fixes:
- New _is_queryable / _first_queryable_resource helpers.
- search_datasets formatter: suggested_resource_id and per-dataset
  resource_id lines now point at the datastore_active resource only.
  When no resource is queryable, say so explicitly instead of suggesting
  a doomed UUID.
- get_dataset formatter: NEXT STEP block points at the queryable
  resource; resource list labels each as [QUERYABLE] or [DOWNLOAD-ONLY]
  and surfaces the download URL for non-queryable ones.
- search_and_query: walks search results and resources to find one with
  datastore_active=true (configurable via dataset_index/resource_index).
  Falls through to the next dataset rather than 404-ing on a download-
  only first match.
- query_data: 404s now append a hint about the datastore_active gotcha
  pointing at get_dataset / search_and_query so the model doesn't keep
  retrying the same UUID.

Tool descriptions updated to call out the queryable-vs-download
distinction.

Tests: 35 passing (was 32). Three new regressions:
- parks-style multi-resource shape skips GeoJSON and queries CSV
- walks to the next dataset when first has no queryable resource
- query_data 404 includes the datastore_active hint

Live smoke test against data.boston.gov confirms ckan__search_and_query
"parks" now returns real Park_Features rows (Iacono/Readville Playground,
East Boston Memorial Park, etc.) instead of 404.
Real failure observed: model asked for "311 service requests closed on
4/29", called ckan__search_and_query without filters, and got 10
unfiltered rows (almost none closed on 4/29; 85 actually were). Two
design problems:
  1) `query` matches dataset metadata, not row content — but the model
     read it as a row filter.
  2) `filters` is equality-only against datastore_search, so it can't
     express "close_date in [4/29, 4/30)" even if used.

Fixes (A + B + C from the proposal):

A) Tool descriptions explicitly distinguish dataset-metadata search
   from row filtering and steer to the right knob (filters / where /
   execute_sql) with concrete examples.

B) Every successful query_data / search_and_query response now ends
   with a "Filterable columns" footer listing field names + types,
   pulled from the datastore response. The next pivot to where /
   execute_sql becomes a one-shot.

C) New structured `where` argument on query_data and search_and_query.
   Routes through datastore_search_sql via SafeSQLBuilder; supports
   eq, ne, gt, gte, lt, lte, in, not_in, like, ilike, is_null. Strings
   are length-capped (256) and quote-escaped; IN-lists capped (100);
   identifiers validated as before.

Live verification: search_and_query("311", where={close_date: {gte:
"2026-04-29", lt: "2026-04-30"}, case_status: "Closed"}) now returns
exactly 85 rows — matching ground truth from a manual SQL count.

Tests: +26 new (170 -> 196). Coverage:
- where-clause builder for every operator + edge case (injection
  escaping, bad ops/types, oversized lists/strings, unknown operators)
- query_data routes through datastore_search_sql when `where` set
- query_data validation errors short-circuit before any API call
- schema footer surfaces in both SQL and non-SQL paths
Boston's 311 dataset attaches 22 queryable CSV resources (a rolling
'NEW SYSTEM' view plus per-year archives 2011–2026). The model only ever
saw whichever the auto-pick chose first — typically the rolling view —
which silently dropped older data on the floor for historical questions.

Two improvements:

(1) `_format_search_and_query` now emits an "Other queryable resources
in this dataset" block listing every datastore_active resource other
than the chosen one (name, format, resource_id), so the model can see
that '311 - 2020', '311 - 2019', etc. exist.

(2) New `resource_name` argument on search_and_query: case-insensitive
substring match against resource `name`. Takes precedence over
`resource_index`. With dataset_index pinned, a no-match returns a clean
error listing the available names.

Selection precedence (high → low):
  resource_name > resource_index > first datastore_active resource

Tool description updated with concrete example
(resource_name='2020' picks the 2020 archive). The 'Other queryable
resources' block tells the model these names exist; the description
tells it how to use them.

Live verification: search_and_query("311", resource_name="2020") now
queries 311 SERVICE REQUESTS - 2020 (resource 6ff6a6fd-...) and returns
2020 records. Default no-resource_name query still picks the rolling
NEW SYSTEM and surfaces all 21 archive siblings in the response.

Tests: +3 (38 -> 41) covering resource_name match, no-match error, and
the siblings block (queryable-only, excludes download-only resources).
Real-world failure: bot answered "How many 311 requests closed on
4/29/2016?" with 100 — but 100 was the LIMIT; the true count is 531.
The model was reading "Found 100 record(s) (showing up to 100)" as
the answer to a counting question.

Fixes:

1. Capture CKAN's `total` from datastore_search responses (it's
   already returned by CKAN; we were discarding it). Plumbed through
   _query_with_schema → ToolResult → formatter.

2. SQL (`where`) path: when SELECT * hits the limit exactly we don't
   know the true total, so issue a cheap SELECT COUNT(*) follow-up
   with the same WHERE. This means we always tell the model a real
   number when it asks a counting question via search_and_query or
   query_data, regardless of which path was taken.

3. Formatter rewrite:
   - Header line is now `total_matching_rows: N (returned K, limit=L)`
     when total is known and exceeds returned. Removed the misleading
     "Found 100 record(s)" wording that conflated rows-returned with
     true count.
   - When truncation is detected, prepend a `=== TRUNCATED ===` block
     stating "the answer is N, NOT K" and pointing at
     ckan__aggregate_data for cheap counts and ckan__execute_sql for
     custom LIMIT/ORDER BY.
   - When total can't be determined (count follow-up failed) and rows
     == limit, prepend `=== MAY BE TRUNCATED ===`.

Live verification (boston prod CKAN):
- search_and_query "311 closed on 2016-04-29" with default limit=100
  → returns the 100 sample rows AND header "total_matching_rows: 531"
  AND the TRUNCATED warning telling the model the answer is 531.
- search_and_query "311 closed on 2026-04-29" with limit=100
  → returns 85 rows, no TRUNCATED warning (85 < 100), header reads
  "total_matching_rows: 85".

Tests: +5 (41 -> 46) covering total surfaced from datastore_search,
no-warning path when under limit, COUNT(*) follow-up fires only when
truncated on the SQL path, and graceful fallback when count fails.
User flagged: limit-as-count confusion isn't unique to query_data.
Same trap exists in execute_sql ("returned 100 because the SQL said
LIMIT 100") and in search_datasets ("found 20 datasets" while CKAN
knows there are actually 47 matches, since package_search returns
`count`). Switched to the user-suggested 'X of Y' phrasing throughout.

Changes:

1. Shared _format_count_header helper renders three cases:
   - "100 of 531 rows returned (limit=100; raise limit or use
     ckan__aggregate_data for full count)."
   - "85 rows returned (full result, limit=100)."
   - "100 rows returned (limit=100, true total unknown — see
     TRUNCATED warning above if any)."
   Used in query_data, search_and_query, and search_datasets
   (with unit="matching dataset(s) shown").

2. search_datasets now reads CKAN's `count` from package_search and
   surfaces it: "5 of 21 matching dataset(s) shown" instead of
   "Found 5 dataset(s)".

3. execute_sql now extracts the effective LIMIT from the validated
   SQL via SQLValidator.extract_top_level_limit() and emits a
   "MAY BE TRUNCATED" block when len(records) >= effective_limit
   (datastore_search_sql doesn't return a `total`, so this heuristic
   is the best we can do). aggregate_data inherits the same guard
   since it formats results via _format_sql_results.

4. Existing query_data/search_and_query header lines switched to the
   X-of-Y phrasing for consistency. The TRUNCATED warning text now
   matches the rest of the corpus.

Live verification (boston prod CKAN):
- search_datasets("parks", limit=5) → "5 of 21 matching dataset(s)
  shown (limit=5; raise limit to see more)".
- execute_sql with LIMIT 100 returning 100 rows → MAY BE TRUNCATED
  block at top, "100 rows returned (limit=100, true total
  unknown...)".

Tests: +8 (204 -> 212). New: search_datasets count rendering, two
execute_sql truncation cases, four extract_top_level_limit unit tests
(simple, semicolon, missing, subquery-ignored), enforce-then-extract
round trip.
@brendanbabb brendanbabb merged commit e9479ca into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant